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Abstract. Network data is ubiquitous and growing, yet we lack realistic generative network 
models that can be calibrated to match real-world data. The recently proposed Block Two-Level 
Erdos-Renyi (BTER) model can be tuned to capture two fundamental properties: degree distribu- 
tion and clustering coefficients. The latter is particularly important for reproducing graphs with 
community structure, such as social networks. In this paper, we compare BTER to other scalable 
models and show that it gives a better fit to real data. We provide a scalable implementation that 
requires only 0((imax) storage where c/max is the maximum number of neighbors for a single node. 
The generator is trivially parallelizable, and we show results for a Hadoop implementation for a 
modeling a real-world web graph with over 4.6 billion edges. We propose that the BTER model can 
be used as a graph generator for benchmarking purposes and provide idealized degree distributions 
and clustering coefficient profiles that can be tuned for user specifications. 
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1. Introduction. Unprecedented amounts of network interaction data are now 
available from online social networks (FaceBook, Twitter), massive multi-player on- 
line games (World of Warcraft, Everquest), computer-to-computer communications, 
financial transactions, instant messaging, and more. As a result, models, algorithms, 
software, and hardware for large-scale graph analysis are struggling to keep pace with 
increasing demands for scalability and relevance. For instance, we lack models that 
provide a realistic baseline for statistical analysis such as anomaly detection. Addi- 
tionally, a major obstacle to working in the field of network analysis is that data is 
naturally restricted due to a combination or security and privacy concerns. For these 
reasons, we need scalable generative models for networks. 

Ideally, a generative model can be calibrated to match real world data. For the 
purposes of this paper, we consider the two most fundamental properties of graphs: 
the degree distribution and the clustering coefficients [40]. Let G = {V^E) be an 
undirected, unweighted graph on vertices V defined by edges in let n = |V| denote 
the number of nodes, and let m = |£^| denote the number of edges. 

In most real-world networks representing interaction data, there are a few nodes 
with high degrees and many nodes with low degrees, with a smooth transition between. 
In other words, the degree distribution is heavy-tailed, and this feature has long been 
considered as the critical feature that distinguishes real networks from arbitrary sparse 
networks [2, 12, 35]. A realistic generative model should be able to reproduce the 
degree distribution, which specifies the number of nodes of each degree {rid}. 

However, the structure prescribed by the degree distribution is only part of the 
story; real-world graphs have an abundance of triangles. In social networks, triangles 
indicate social cohesion, and it is likely that two people with a common friend are 
much more likely to be friends themselves. Figure 1.1 shows a wedge centered at node 
1, i.e., the path 2-1-3 is a wedge. If edge 2-3 exists, then the wedge is closed (forms 
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Fig. 1.1: Wedge centered at node 1. 



a triangle); otherwise it is open. Note that every triangle corresponds to three closed 
wedges, so the number of closed wedges is three times the number of triangles. To 
measure this, we define the global clustering coefficient (GCC) [41, 3] as 

# of closed wedges 
# wedges 

We are specifically interested in clustering coefficient per degree [41], defined as 

# of closed wedges centered at a node of degree d 
^ # wedges centered at a node of degree d 

The clustering coefficients of real- world graphs are much higher than those of random 
graphs with the same degree distribution [17]. It is known that real- world graphs 
have significant clustering coefficients and that these are important to community 
structure. Nonetheless, most generative models fail to match clustering coefficients of 
real- world graphs [34]. 

1.1. Contributions. The Block Two-Level Erdos-Renyi (BTER) [36] has been 
proposed as a model that can be tuned to capture both the degree distribution and 
degree- wise clustering coefficients for real- world networks. The goal of this paper is to 
focus on the implementation and scalability of this model. Hence, this paper makes 
the following contributions: 

• We describe a scalable implementation of the BTER generative graph model 
that uses efficient data structures. Our reference implementation requires working 
storage of at most lO-dmax values where (imax <C n is the largest degree. The generation 
cost per edge is 0(log (imax)- Our approach generates all edges independently and thus 
can be easily parallelized. Moreover, the edges can be generated in an arbitrary order, 
so the BTER generative model can also be used in streaming scenarios. 

• We demonstrate that BTER can accurately recreate the degree distributions 
and triangle behavior of large real- world graphs. We show results for several example 
graphs from the Laboratory for Web Algorithms [43], including a graph with over 130 
million nodes and 4.6 billion edges. We also compare BTER to competing methods 
on a pair of smaller graphs. 

• Finally, we propose BTER as a standard graph generator for benchmarking 
purposes. Since BTER can work with arbitrary degree distributions, we propose 
an "ideal" degree distribution: discrete generalized log-normal (DGLN). This is a 
two-parameter model that can easily match a desired maximum possible degree (an 
absolute bound) and a desired average degree. We also propose a semilog-linear model 
for clustering coefficients. 

We note that there is no fftting step for BTER — it takes the target degree 
distribution and clustering coefficients per degree directly as input. The clustering 
coefficients can be expensive to compute, but we have recently proposed a sampling 
method that scales to very large graphs [38, 20]. 
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1.2. Related Work. Since the goal of this paper is to focus on the implemen- 
tation and scalability of BTER, we limit our discussion to the most salient related 
models. A more thorough discussion of related work can be found in the original 
paper on BTER [36]. 

The majority of graph models add edges one at a time in a way that each random 
edge influences the formation of future edges, making them inherently unscalable. 
The classic example is Preferential Attachment [2], but there are a variety of related 
models, e.g., [21, 24]. These models are more focused on capturing qualitative prop- 
erties of graphs and typically are difficult to match to real- world data [34]. Perhaps 
the most relevant is [19], which creates a graph with power law degree distribution 
and then "rewires" it to improve the clustering coefficients. Another related model, 
the clustering model proposed Newman [31], assigns "individuals" to "groups" (a 
bipartite graph with individual and group nodes) and then creates a graph of con- 
nections between individuals by assigning connection probabilities to each group; in 
other words, each group is modeled as an Erdos-Renyi graph. 

A widely used model for modeling large-scale graphs is the Stochastic Kronecker 
Graph (SKG) model, also known as R-MAT [8, 23]. The generation process is easily 
parallelized and can scale to very large graphs. Notably, SKG has been selected as 
the generator for the Graph 500 Supercomputer Benchmark [42] and has been used 
in a variety of other studies [13, 26, 29, 28, 18, 14, 27]. Unfortunately, SKG has some 
drawbacks. (1) SKG can be extremely expensive to fit to real data (using KronFit), 
and even then the fit is imperfect [23]. (2) It can generate only lognormal tails (after 
suitable addition of random noise) [39, 37], limiting the degree distributions that it can 
capture. (3) Most importantly, SKG rarely closes wedges so the clustering coefficients 
are much smaller than what is produced in real data [34, 20]. 

Another model of relevance is the Chung-Lu (CL) model [10, 11, 1]. It is very 
similar to the edge-configuration model of Newman et al. [30]. Let di denote the 
desired degree for node i. In the CL model, the probability of an edge is proportional 
to the degrees of its endpoints, i.e., the probability of edge (i, j) is oc didj. Edges 
can be generated independently by picking endpoints proportional to their desired 
degrees. If all degrees are the same, CL reduces to the well-known Erdos-Renyi 
model [15]. The CL model is often used as a null model; for example, it is the basis of 
the modularity metric [32]. Graphs generated by the CL model and the SKG model 
are, in fact, very similar [33]. The advantage of the CL model is that it can be better 
tuned to real- world degree distributions. The disadvantage of the model is that, like 
SKG, it rarely closes wedges. CL is a special case of BTER that skips Phase 1 (see 
§2), and the CL construction is a very important part of BTER. 

2. BTER Generative Graph ModeL 

2.1. The BTER ModeL Our earlier work argued that a graph with a heavy- 
tailed degree distribution and high clustering coefficients must contain Erdos-Renyi 
blocks of densely connected vertices; moreover, the distribution of the sizes of those 
groups follows the same type of distribution of the degrees (e.g., power law) [36]. Based 
on this premise, the BTER model [36] organizes nodes into affinity blocks such that 
nodes within the same affinity block have a much higher chance of being connected 
than nodes at random, but BTER also behaves like the CL model in that it is able 
to match an arbitrary degree distribution. 

We ffist give a high level description of a serial version of BTER [36] . The BTER 
model requires two user-specified inputs: (1) desired degree distribution, denoted by 
{rid} where Ud is the number of nodes of degree d; and (2) clustering coefficient by 
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degree, denoted by {cd} where q is the desired mean clustering coefficient for nodes 
of degree d. (We note that the original description in [36] did not take the original 
clustering coefficients but rather a function to determine the connectivity.) The de- 
sired number of nodes and edges are n = J2d^d and m = ^ J2d^ ' respectively. 
There are three main steps to the graph generation, as described below. The steps 
are depicted in Figure 2.1. 

Preprocessing. Imagine starting with n isolated vertices. Each vertex is assigned 
a degree, based on the degree distribution {ud}. So we arbitrarily assign ni vertices 
to have degree 1, n2 to have to degree 2, etc. We then partition the n vertices into 
affinity blocks. In general, an affinity block contains d -\- 1 vertices that are assigned 
degree where d varies over the entire range. Note that there are many small blocks 
with vertices of low degree, and a few large blocks of high degree vertices. At this 
stage, no edges have been added. 

Phase 1. This phase adds edges within each affinity block. Each block is a dense 
Erdos-Renyi graph, where the density depends on the size of the block. For a block 
involving degree d vertices, the density is chosen based on Cd (the observed clustering 
coefficient). All the parameters are chosen to ensure that each vertex achieves its 
desired clustering coefficient and is not incident to more edges than its desired degree. 

Phase 2. This phase adds edges between the blocks. Consider some vertex i with 
an assigned degree di. Suppose it is already incident to edges from Phase 1. We set 
Ci = di — d[ to be the excess degree of i. We must create edges incident to vertex i to 
satisfy its degree requirement. We construct a Chung-Lu graph with degree sequence 
ei, 62, . . . , to complete the graph construction. 





(a) Preprocessing: 
Distribution of nodes into 
affinity blocks 



(b) Phase 1: Local links 
within each affinity block 




(c) Phase 2: Global links 
across affinity blocks 



Fig. 2.1: BTER model phases [36]. 



2.2. Developing a scalable implementation. Our goal is to show that it 
is possible to have a highly scalable implementation of the BTER method. The 
main goal is to have independent edge insertions so that the edge generation can be 
parallelized. 

As stated. Phase 2 edge insertions must happen after Phase 1, because we need 
to know the excess degrees. We parallelize this process by computing the expected 
excess degree. Given all the input parameters, we can precompute the expected excess 
degree for any vertex (this requires the compact representations and data structures) 
during the preprocessing. From this, we can precompute the total number of Phase 
1 and Phase 2 edges. 

To perform a parallel edge insertion, we first decide randomly whether this should 
be a Phase 1 or Phase 2 edge. For a Phase 1 edge, we select a random affinity block 
(with the appropriate probability) and connect two uniform members of the block. 
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For a Phase 2 edge, we perform a Chung-Lu edge insertion based on expected excess 
degrees. Because every edge is generated independently, there wih be duphcates, but 
these are discarded in the final graph. 

Given the structure of parallel edge insertion, the main challenges in developing 
a scalable implementation are as follows: 

• Preprocessing data structures. A naive implementation of the prepro- 
cessing step would require 0{n) variables and storage. We design compact 
representations and data structures for the affinity blocks. This contains all 
the relevant information with minimal storage. 

• Repeats in Phase 1. A direct parallel implementation of Phase 1 leads to 
many repeated edges, and this affects the overall degree distribution when 
edge repeats are removed. We model this problem as a coupon collector 
problem and give formulas for the number of extra Phase 1 edges to be inserted 
to rectify this. 

• The degree- 1 problem of Phase 2. A parallel implementation of Phase 
2 results in numerous degree 1 vertices becoming isolated. We use a fix for 
this proposed in [14], which is different than the one proposed int he original 
BTER paper [36]. 

Once these issues are addressed, we have an embarrassingly parallel edge generation 
algorithm that requires only 0(log(imax) work per edge. The remainder of this sec- 
tion gives an in-depth but informal presentation of our implementation. A detailed 
algorithm specification is provided in the appendix. 

2.3. Preprocessing. First, some notation. We let (imax denote the maximum 
degree. We let di denote the (desired) degree of vertex i. For convenience, the nodes 
are indexed by increasing degree except for degree- 1 nodes, which are indexed last. 
Hence, if di^dj > 2 and i < then di < dj. As an example, see the numbering in 
Figure 2.2. 

Degree 1 @(49)(50)@(52)(53)@ (73) 
Degree 2 Q@@@0®©(8)(9)@@@@@@@@@(J9)@ 
Degree 3 @(S)(23)(g)(g)(g)@@(2^ 
Degree 4 (Si) (S^) Ss) (S^ Sb) (S^ 



Degree 5 {ST^S} (39) (40) 
Degree 6 (S) (42) (43) 
Degree 7 (44) (45) 



Degree 8 (4^ 
Degree 9 



Fig. 2.2: Example of affinity blocks and groups. 



In the preprocessing phase, we assign nodes to affinity blocks. For the assignment 
to affinity blocks, degree- 1 nodes are ignored. The remaining nodes are assigned to 
affinity blocks in order (of degree). A homogeneous affinity block has d -\- 1 nodes 
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of degree d. In Figure 2.2, the blocks are denoted by colored ovals, and 4-5-6 is a 
homogeneous block. The vast majority of (low-degree) nodes are assigned to homoge- 
neous affinity blocks. However, there are not always enough nodes of degree d to fill 
in a homogeneous block; therefore, we also have a few (at most (imax) heterogeneous 
affinity blocks with nodes of different degrees. For instance, in Figure 2.2, 19-20-21 is 
a heterogeneous block. 

Observe that storing a single block requires only three pieces of information: the 
starting index, the block size, and the block weight (related to its desired connectivity, 
which is a function of block size and clustering coefficient for the minimum degree 
in the block). However, note further that all affinity blocks of the same size and 
minimum degree can be grouped together into an affinity group — all blocks in the 
same group share the same block size and weight. In Figure 2.2, all nodes with the 
same color are in the same affinity group, e.g., 1-21 are in the same affinity group, 
likewise nodes 22-33, etc. The information needed to store an affinity group boils 
down to 4 items of information: the starting index of the group, the number of blocks 
in the group, the size of each block, and the weight of each block. The maximum 
number of groups is bounded by c/max, so we need to store at most 4 • (imax values. 

Phase 2 needs to know the expected excess degree of all nodes, which is the 
difference between the desired degree and the number of expected links from Phase 1. 
Again, a naive implementation would require 0{n) information, but most nodes of the 
same degree behave the same. In a block where all nodes have the same degree, we 
say the nodes are bulk nodes. In a block with nodes of differing degrees, all nodes with 
degree equal to the minimum degree are still bulk nodes. We refer to the remaining 
nodes as filler nodes. In Figure 2.2, nodes 1-20 are degree-2 bulk nodes, nodes 22-30 
are degree-3 bulk nodes, nodes 34-36 are degree-3 bulk nodes, and so on. Node 21 is 
a degree-3 fill node, nodes 31-33 are degree-4 filler nodes, etc. It is possible to have 
either no bulk nodes or no filler nodes for a given degree. In Figure 2.2, there are no 
filler degree-2 nodes and no bulk degree-6 nodes. Observe all bulk nodes of degree d 
(for any d) are in blocks of the same size and connectivity; therefore, they all have 
the same excess degree. The filler nodes of degree d (for any d) participate in at most 
one block and so all have the same excess degree. This means that there are two 
possible values for excess degree for the set of nodes with desired degree d. For each 
degree Phase 2 needs just 5 values: the number of filler and bulk nodes, the excess 
degree for filler nodes and for bulks nodes, and the starting index. (Technically, the 
starting index can be recomputed from n^/, but it reduces the work to store these 
indices explicitly. Likewise, we need not store both the bulk and filler counts so long 
as we keep the total number of nodes per degree.) Hence, the total working storage 
for Phase 2 information is 5 • t/max values. 

See the algorithm in Appendix A for details. The total storage (including inputs) 
needed by the generation routine is 10 • <imax values. It is possible to modify the core 
data structures to store only the distinct degrees instead of maintaining a continuum 
of degrees until (imax- This would change the store requirement to 0{duniq) instead 
of 0{d 

max)? where duniq 1 s the number of distinct degrees in the graph. However, we 
present our ideas based 0((imax) storage for clarity of presentation. 

2.4. Phase 1. Phase 1 creates intra-block links. Each affinity block is modeled 
as an Erdos-Renyi graph. An overwhelming majority of the triangles are formed in 
this phase, and thus we pick the Erdos-Renyi constant, p, for the block to match 
the target clustering coefficient c. A vertex of degree (i, and clustering coefficient c 
is incident to c • (2) triangles. Assume this vertex is grouped with other vertices of 
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degree d into block with d -\- 1 vertices, which holds for the a great majority of the 
vertices. If we build an Erdos-Renyi graph of this block with parameter p, then this 
vertex is expected to be incident to (2)^^ triangles. Solving for p yields p = 
Therefore, for block 6, the connectivity is pi, = ^/c^ where d^ denotes the minimum 
degree in the block (since most blocks are homogeneous, this choice works well). Note 
that the clustering coefficients of vertices will be higher if we only consider the affinity 
blocks. This is to compensate for the edges that will be added in Phase 2 to increase 
the number of wedges, likely without contributing any triangles. 

The difficulty in phase 1 is that we expect a preponderance of repeat edges in 
the case where edges are generated independently with replacement. Consider affinity 
block b with nodes and connectivity p^, meaning that each node in block b wants 
internal degree Pb-di^. BTER wants approximately = Pb{^2) ^^^^'^^ edges in block 
b. Determining the number of draws with replacement to get a desired number of 
unique items can be cast as a coupon collector problem. We can show that a good 
approximation for the expected number of edges that need to be inserted is 

w,= (^''^yn{l/{l-p,)). (2.1) 

The proof is provided in Appendix B. 

We illustrate the utility of equation (2.1) with an example with 715 = 10 nodes and 
connectivity p^ = 0.5, corresponding to 7715 = 22.5 edges, on average. In this case, the 
formula predicts that we need to do = 31.1916 draws, in expectation, to see the 
desired number of unique edges, in expectation. We do 10000 random experiments as 
follows. For i = 1, . . . , 10000, the random variable Xi ~ Poisson(iL'^) is the number of 
items drawn from the (^^) = 45 possible edges, and Yi is the number of those items 
that are unique. A histogram of the Yi values is shown in Figure 2.3. The average 
number of unique items exactly corresponds with the desired value. 



:10,P^. 



0.5, w =31.1916 



1400 





5 10 15 20 25 30 35 40 



Unique Items 

Fig. 2.3: Distribution of the number of unique edges on 10,000 random trials. 



From equation (2.1), we can determine the number of extra edges needed in Phase 
1. Specifically, we insert w'^^^ = ^^^^ edges to get a total of m^^^ = ^^rab unique 
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Phase 1 edges. Given that we are generating a Phase 1 edge, the process proceeds as 
fohows: 

1. Pick an affinity group randomly proportional to the total weight of its con- 
stituent blocks. 

2. Pick a block within the affinity group uniformly at random (all blocks within 
the same group have the same weight). 

3. Pick two nodes without replacement uniformly at random from the selected 
block — these two nodes form an edge. 

The ffist step is a weighted sampling step and requires 0(log ^max) work, where ^max < 
<^max is the total number of affinity groups. The other 2 steps are constant time 
operations. 

2.5. Phase 2. Phase 2 is simply a Chung-Lu model on the expected excess 
degrees. In creating an edge, we choose two nodes independently. Those nodes are 
chosen proportional to their excess degree. For node i in group 6, let Wi = ^ [di — {pi^ • 
di,)] denote half its excess degree. The total number of edges that should be inserted 
in Phase 2 is w'^'^^ = Wi. Duplicate edges are fairly rare, so we expect m*^^^ ^ w'^'^^ 
Phase 2 edges. 

Let n^^^ and n^^^^ be the number of filler and bulk nodes of degree let Wd = 
XliGVd weight of all degree-d nodes, and let be the proportion that are 

filler nodes. Inserting a Phase 2 edge proceeds as follows: 

1. Pick degree d proportional to Wd. 

2. Pick between filler and bulk nodes, according to r^. 

3. Pick a filler or bulk node (depending on the outcome of Step 2) of degree 
uniformly at random. 

The first step is a weighted sampling step and required 0(log((imax)) work, while the 
other two steps are constant time operations. 

One complication in Phase 2 is that getting the correct number of degree- 1 nodes 
poses a problem — approximately 36% of the pool of potential degree- 1 nodes will not 
be selected and another 28% will have degree 2 or larger. A fix for this problem has 
been proposed in [14], which involves increasing the pool of degree-1 nodes, without 
changing the the expected number of edges that will be connected to these vertices. 
This increase in the pool size is controlled by the "blowup factor," P > 1. This is 
included in the algorithm described in Appendix A. 

2.6. Independent Edge Generation. Lastly, we pull everything together to 
explain the independent edge generation. We insert a total of w = w'^^^ + w'^'^^ edges, 
fiipping a weighted coin for each edge to determine if it is Phase 1 or Phase 2. We 
expect to generate a total of m = m^^^ + m^^^ edges. 

Generating the edges is extremely inexpensive: 0(log((imax)) per edge. The ex- 
pensive step is de-duplication where extra copies of repeated edges are removed. The 
same difficulty exists for the current Graph 500 (SKG) benchmark. Some may argue 
that duplicate edges are a useful feature since real data also has duplicates, but it is 
not clear that the duplication rates are similar to those observed in real data. 

2.7. Implementations. We provide the algorithm in Appendix A. We have a 
reference implementation in MATLAB that will be made available in the future. We 
have also implemented the method in Hadoop and use this in some of our experiments. 
The implementation is straightforward — we divide evenly the work of creating the 
w edges between a user-specified number of mappers. Each edge is hashed, and that 
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hash is the key for the reduce phase. The reducers remove any duphcate edges and 
emit a final hst of ah edges. 

3. Numerical Comparisons. 

3.1. Small data. In Figure 3.1, we present comparisons of BTER with the 
state-of-the-art in scalable generative models: SKG (current Graph 500 generator) 
[8, 23, 42] and CL [1, 10, 11, 14] on two smah data sets available from SNAP [44]. 
We treat all edges as undirected and remove any duplicate edges and loops. The 
graph ca-AstroPh is a collaboration network based on 124 months of data from the 
astrophysics section of the arXiv preprint server; it has 9,987 nodes and 25,973 edges. 
The graph soc-Epinionsl is a who-trusts-whom online social (review) network from 
the Epinions website with 75,879 nodes and 405,740 edges. 

The parameters of SKG are from [23], obtained from their KronFit algorithm. 
The input to CL is the degree distribution of the real graph and a blowup factor of 
10 (same as is used for BTER for better matching degree-1 vertices [14]). The inputs 
to BTER are the degree distribution and clustering coefficients per degree (computed 
exactly) of the real graph and a blowup factor of 10. Timings are not reported as 
they are negligible for all three methods. 

Degree distribution. The degree distributions for the original graphs and the mod- 
els are shown in Figures 3.1a and 3. Id. SKG is known to have oscillations in the degree 
distribution [39, 37], and these oscillations are easily visible in Figure 3. Id. The os- 
cillations are correctable with appropriate addition of noise [39, 37] (not shown), but 
even then it tends to over estimate the low degree nodes and miss the highest degree 
nodes. In contrast, both CL and BTER closely match the real data. 

Clustering coefficients. The clustering coefficients per degree for the original 
graphs and the models are shown in Figures 3.1a and 3.1e. The SKG graph model 
has no inherent mechanism for closing triangles and creating community structure. 
Though a few triangles may close at random, they are insufficient for the SKG- 
generated graph to match the clustering coefficients in the real data. The situation 
for CL is similar to that for SKG — there is no reason for wedges to close. BTER, 
on the other hand, provides a much closer match to the real data. 

Eigenvalues of adjacency matrix. We show the top 50 leading eigenvalues (in 
magnitude) of the adjacency matrix in Figures 3.1c and 3. If. BTER is a much closer 
match to the real data — especially the ffrst few eigenvalues. Under certain circum- 
stances, matching the degree distribution should produce a match in eigenvalues [25]. 
However, we conjecture that graphs with community structure require that triangle 
structure also be matched to obtain a good fit for the eigenvalues. 

3.2. Large data. We demonstrate that BTER is able to fit large-scale real- world 
data. We do not compare to SKG because it is not possible to fit the parameters for 
such large graphs. We do not compare to the CL model because we can easily explain 
the performance: its match in terms of the degree distribution is nearly identical to 
that of BTER, and its clustering coefficients are close to zero, as for the small data. 
The data sets are described in Table 3.1a. We treat all edges as undirected and remove 
any duplicate edges and loops. We obtained real-world graphs from the Laboratory 
for Web Algorithms [43], which compressed the graphs using LLP and WebGraph 
[7, 5]. Brieffy, the networks are described as follows. 

• amazon-2008 [7, 5]: A graph describing similarity among books as reported 
by the Amazon store. 

9 




Degree Degree Eigenvalue 

(a) Degree distribution for (b) Clustering coefficients for (c) Leading adjacency matrix 
ca-HepTh ca-HepTh eigenvalues for ca-HepTh 




Degree Degree Eigenvalue 

(d) Degree distribution for (e) Clustering coefficients for (f) Leading adjacency matrix 
soc-Epinionsl soc-Epinionsl eigenvalues for soc-Epinionsl 



Fig. 3.1: Comparison of CL, SKG, and BTER on small graphs. 



• ljournal-2008 [9, 7, 5]: Nodes represent users on LiveJournal. Node x connects 
to node y \i x registered as a friend. 

• holly wood-20 11 [7, 5]: This is a graph of actors. Two actors are joined by an 
edge whenever they appear in a movie together. 

• twitter-2010 [22, 7, 5]: Nodes are Twitter users, and node x links to node y 
if y follows X. 

• uk-union-2006-06-2007-05 (shorted to uk-union) [6, 7, 5]: Links between web 
pages on the .uk domain. We ignore the time labeling on the links. 

The smaller graphs (amazon-2008, ljournal-2008, hollywood-2011) are those with 
up to roughly lOOM edges. These can be easily processed using MATLAB on an SGI 
Altix UV 10 with 32 cores (4 Xeon 8-core 2.0GHz processors) and 512 GB DDR3 
memory. None of the parallel capabilities of MATLAB are enabled for these studies. 
To give a sense of the memory requirements, storing the hollywood-2011 graph as a 
sparse matrix in MATLAB requires 3.4GB of storage. The larger graphs (twitter- 
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Table 3.1: Network characteristics of original and BTER-generated graphs. 



(a) Large data set properties. 



Graph 


1^1 


1^1 


C^max 


O^avg 


GCC 


amazon-2008 


IM 


4M 


1,077 


10 


0.260 


ljournal-2008 


5M 


50M 


19,432 


18 


0.124 


hollywood-2011 


2M 


114M 


13,107 


115 


0.175 


twitter-2010 


40M 


1,202M 


2,997,487 


60 


0.001 


uk-union 


122M 


4,659M 


6,366,528 


76 


0.007 



(b) Properties of BTER-generated graphs, including generation and edge deduplication time. 



Graph 




1^1 


O^max 


davg 


GCC 


Gen. 


Dedup. 


amazon-2008 


IM 


4M 


1,052 


10 


0.253 


2.27s 


9.25s 


ljournal-2008 


5M 


49M 


18,510 


19 


0.127 


33.81s 


126.40s 


hollywood-2011 


2M 


114M 


11,676 


115 


0.180 


88.54s 


362.25s 


twitter-2010 


38M 


1,133M 


1,635,823 


59 


0.004 


222.87s 


uk-union 


120M 


4,399M 


1,497,950 


73 


0.111 


1638.28s 



2010, uk-union) each have over IB edges. These are processed on a Hadoop cluster 
with 32 compute nodes. Each compute node has an Intel 17 930 CPU at 2. 8GHz (4 
physical cores, HyperThreading enabled), 12 GB of memory, and 4 2TB SATA disks. 
All experiments were run using Apache Hadoop version 0.20.203.0^. 

The inputs to BTER are the degree distribution and clustering coefficients by 
degree. (We used a blowup of /3 = 1 for the experiments reported here.) Computing 
the degree distribution is straightforward. However, for the clustering coefficients 
calculations, we used the sampling approach as implemented in [20] with 2000 samples 
per degree, so the expected error is e = 0.05 at a confidence level of 99.9%. Sampling 
was not required for the smaller graphs, but we have used it in all experiments for 
consistency. 

BTER Timing. Table 3.1b shows the details and timings for the graphs produced 
by BTER. Observe the close match in the characteristics of the graphs in terms of 
number of nodes, number of edges, maximum degree, average degree, and global clus- 
tering coefficient. For the smaller graphs, we are able to separate the edge generation 
and deduplication time. The generation time is not strictly proportional to the num- 
ber of desired edges because we have to generate extra edges for Phase 1 to account for 
possible duplicates (see §2.4). The parallelism of Hadoop yields a large advantage in 
terms of time. The twitter-2010 graph has 10 times more edges than hollywood-2011, 
but it takes less than half the time to do the computation on the 32-node Hadoop 
cluster. 

Degree Distribution. Figure 3.2 illustrates the match between the real data and 
the BTER graph. BTER cannot easily match discontinuities in the degree distribution 
because of the randomness in creating edges. The issue is that nodes generally do not 
get exactly the desired degree — it may be one or two more or less. For a smooth degree 
distribution, neighboring degrees cancel the effect on one another. For discontinuous 
distributions, the BTER degree distribution is a "smoothed" version. This is evident, 
for instance, in the amazon-2008 data where we can see a smoothing effect on the 



http : / /hadoop . apache . org/ 
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sharp discontinuity near degree 10. 




Degree Degree Degree 



(a) amazon-2008 (b) ljournal-2008 (c) hollywood-2011 




(d) twitter-2010 (e) uk-union 

Fig. 3.2: Degree distributions of original and BTER-generated graphs. 



Clustering Coefficients. BTER's strength is its abihty to match clustering co- 
efficients and therefore community structure. Most degree distributions are heavy 
tailed and have a relatively consistent structure. The same is not true for clustering 
coefficients. Different profiles can potentially lead to graphs with fundamentally dif- 
ferent structures. Figure 3.3 shows the clustering coefficients of the real data and the 
BTER-generated graphs. There is a very close match. 

4. Proposed Benchmark. So far we have shown that BTER can be used to 
match real-world data. For benchmarking purposes however, reasonable "ideal" pro- 
files for degree distribution and clustering coefficient by degree are required. In this 
section, we propose some possibilities, noting that these can easily be changed for 
whatever scenario a user may encounter. 
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(a) amazon-2008 (b) ljournal-2008 (c) hollywood-2011 




(d) twitter-2010 (e) uk-union 

Fig. 3.3: Clustering coefficients of original and BTER-generated graphs. 



4.1. Idealized Degree Distribution. It has been hypothesized that degree 
distribution of real- world networks follow a power law (PL) degree distribution, i.e., 

Ud oc , 

for some parameter 7 [2]. However, our observation is that power law distributions 
are difficult to use as a model — a point that is discussed in more detail below. It has 
been suggested that power laws are not necessarily the best descriptors for real-world 
networks [35, 4]. Finally, proving (in a statistical sense) that a single observed degree 
distribution is power law is difficult [12]. 

For benchmarking purposes, our goal is to specify an ideal average degree, J, and 
an absolute bound on maximum degree, d*. Let f{d) define the desired proportionality 
of degree d. We then create a discrete distribution on d = 1, . . . , as 

Vr{D = d)= ■ 
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Ideally, the average degree is equal to d and the probability of having degree is 
sufficiently small, i.e.. 



where etoi is small enough such that n • etoi <C 1 (where n is the number of nodes). 
For the power law distribution, it can be difficult to find a value for 7 that yields a 
high enough average degree and a low enough probability of choosing d*. Hence, we 
propose instead a generalized log-normal (GLN) distribution, i.e.. 



for some parameters a and (3. Additionally, the shape of the distribution is typical of 
the real- world graphs shown in §3. 

We consider two scenarios, both with n = 10^ nodes. We do a parameter search 
on a and P (f minsearch in MATLAB) to locate the optimal parameters. A function 
for finding the optimal parameters for either discrete GLN or discrete PL for user- 
specified values of davg and (imax is included in the reference code to be released at a 
future date. 

Scenario 1 for Degree Distribution Fitting. In the first scenario, the targets are 
d = 16 and (i* = 10^. For discrete PL, the optimal parameter is 7 = 1.911 with 
davg = 16 and Fi {D = d^) = 1.97 x 10~^^. For discrete GLN, the optimal parameters 
are a = 1.988 and /3 = 2.079 with davg = 16 and Ft {D = d*) = 4..U x 10"^^. 
Realizations of the two distributions are pictured in Figure 4.1a. For this scenario, 
both degree distributions are reasonable in that there is no sharp drop off as we get 
close to the maximum allowable degree, d*. 

Scenario 2 for Degree Distribution Fitting. In the second scenario, the targets are 
d = 64 and = 10^. For PL, the optimal parameter is 7 = 1.668 with (iavg = 64 but 
Ft {D = d*) = 2.16 x 10~^ (fairly large). For discrete GLN, the optimal parameters 
are a = 2.171 and /3 = 1.877 with davg = 64 and Ft {D = d"") = 8.35 x 10"^^ 
Realizations of the two distributions are pictures in Figure 4.1b. In this scenario, the 
problem with power law becomes apparent — near d*, there are still many degrees 
with multiple nodes so that the cutoff is extremely abrupt. In comparison, discrete 
GLN fades more naturally to the desired maximum degree. 

4.2. Idealized Clustering Coefficients. As there is no definitive structure 
to clustering coefficients, we propose a simple parameterized curve that has some 
similarity to real data observations. 

Let {rid} define the specified degree distribution and dmax be the maximum 
degree such that Ud > 0. We define q, the mean value for Cd^ as 



where Cmax and ^ are parameters. If Cmax is specified, then a simple parameter search 
can be used to fit ^ to a target global clustering coefficient; code to fit the data is 
included in the reference code. The final values are for {cd} are selected as 



d = ^d-f{d) and Pr(L) = d*) < etoi, 



d=l 




Cd = Cmaxexp(-((i - 1)^) for d > 2, 



Crf-A^(Q,min{10-^Q/2}). 



The randomness could, of course, be omitted. 
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(a) Scenario 1: d = 16 and (b) Scenario 2: d = 64 and 

d* = 10^ d* = 10^ 



Fig. 4.1: Example degree distributions from discrete power law (DPL) and discrete 
generalized log normal (DGLN) for n = 10^ nodes. 



4.3. Example Graphs. We generate two example graphs per the scenarios be- 
low. Table 4.1 lists the network characteristics and Figure 4.2 shows the target and 
BTER-generated degree distributions and clustering coefficients. 

Table 4.1: Network characteristics of BTER-generated graphs for benchmarking. 



Graph 


\y\ 


1^1 


C^max 


(iavg 


GCC 


Gen. 


Dedup. 


Ideal 1 


IM 


35M 


28,643 


72 


0.406 


35.11s 


117.18s 


Ideal 2 


IM 


8M 


2,594 


17 


0.104 


5.07s 


20.66s 



Scenario 1. For the ffist set-up, we selected d = 7b and = 100, 000 to define the 
degree distribution. The parameter search selected a = 2.14 and /3 = 1.83. For the 
clustering coefficients, we set Cmax = 0-9 and targets a GCC of 0.15. The parameter 
search selected ^ = 3.59 x 10~^ for defining the clustering coefficient proffie. 

Scenario 2. For the second set-up, we selected J = 16 and d* = 10,000 to define 
the degree distribution. The parameter search selected a = 1.98 and /3 = 2.08. For the 
clustering coefficients, we set Cmax = 0-5 and targets a GCC of 0.10. The parameter 
search selected ^ = 0.01 for defining the clustering coefficient profile. 

5. Conclusions. This paper demonstrates that the BTER generative model is 
appropriate for modeling massive networks. We provide a detailed algorithm along 
with analysis explaining the workings of the method. The original paper on BTER 
[36] provided none of the implementation details and, in fact, did not directly use the 
clustering coefficient data but rather estimated it via a function. Here we give precise 
details on the implementations, which is nontrivial due to issues such as repeat edges. 
We are able to build a model of a graph with 120M nodes and 4.7M edges in less than 
30 minutes on a 32-node Hadoop cluster. 

The development of a realistic graph model is an important step in developing 
effective "null" models that nonetheless share the properties of real-world networks. 
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(a) Degree distribution for (b) Degree distribution for 

Scenario 1 Scenario 2 




Such models will be useful in detecting anomalies, statistical sampling, and community 
detection. For example, the BTER model does not have larger communities beyond 
the affinity blocks, whereas we might expect that real- world graphs have a richer 
structure such as a hierarchy or other complex behavior. 

We believe the proposed BTER model, along with the proposed degree and clus- 
tering coefficient distributions, will boost benchmarking efforts in graph processing 
by providing realistic graph instances. The proposed degree distributions capture 
the essence of degree distributions that we see in practice and generate realistic dis- 
tributions even at large scales (whereas power law has a reputation of generating a 
few degrees that are much larger than observed in practice). Moreover, the proposed 
distribution allows us to modify both the average and the maximum degree, which is 
critical for benchmarking. The proposed clustering coefficient curves implicitly em- 
bed triangle structure into the graphs, which is a critical feature that distinguishes 
real graphs form arbitrary sparse graphs. And finally, the proposed generation algo- 
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rithm enables generating extremely large graphs thanks to its parallel edge generation 
property. All together, we believe the proposed model will fill a much needed hole in 
benchmarking graph processing 

Of course, we only consider the case of a static, unattributed, undirected network. 
Future work will be aimed at developing models that capture dynamics, attributes, 
and direction, but even capturing the degree distribution in a directed graph poses 
challenges [14]. 

Appendix A. BTER algorithm details. 

For convenience, notation is described in Table A.l. The node- and block- level 
variables are not used in the algorithms. 

A.l. Preprocessing. The BTER Setup procedure is described in Algorithm 1. 
The inputs are the degree distribution, {rid}; the clustering coefficients per degree, 
{ Q }; and the blowup factor for degree- 1 nodes, p. 

The method precomputes the index for the first node of each degree, {id}, and 
the number of nodes with degree greater than the degree (i, {^^ }• The latter is not 
saved. 

The degree- 1 nodes are handled specially. All degree- 1 nodes are arbitrarily as- 
signed to be "fill" nodes. The number of candidate degree- 1 nodes may be increased 
using the blowup factor, /3. However, if the blowup is used, the majority of the 
(desired) degree- 1 nodes will ultimately have degree and can be removed in post- 
processing. 

The main loop walks through each degree, determining the information for Phases 
1 and 2. 

It first allocates degree-d nodes to be fill nodes for the last incomplete block, 
if needed. The number of nodes necessary to complete the last incomplete block is 
denoted by n^^^. The excess degree of any fill nodes depends on the internal degree of 
the last incomplete block, denoted by d^. The excess degree is used to determine the 
weight of the degree-d fill nodes for phase 2, w^^^. 

If more nodes of degree d remain, these are allocated as bulk nodes and a new 
group is formed. The number of bulk nodes of degree d is denoted by n^^^^. For 
the new group, we determine the index of the first node, the number of blocks, and 
the size of each block. The very last block of the very last group is special because 
the remaining nodes may not be enough to fill it. For simplicity and because it is 
often the case for heavy-tailed networks, we assume the last group contains only a one 
block. This makes handling it as a special case simpler. We compute excess degree for 
these bulk nodes and then the corresponding weight of degree-d bulk nodes for phase 
2, w^^^^. We also compute the weight of the group, Wg, using the coupon collector 
over-sampling weight described in §2.4. Finally, we compute the number of nodes 
needed to fill out the last block, n^^^. 

Rather than returning w^^^ and w^^^^ directly, it is easier (for the edge generation 
phase) to have their sum, Wd, and the ratio of fill nodes, rd- Likewise, we do not 
return n^^^^ since it can be easily recomputed using Ud and n^^^. We do return id^ 
but this could be omitted and recomputed if that were more efficient (e.g., reduc- 
ing communication to workers). Finally, we no longer need to keep {q} after the 
preprocessing is complete. 

A. 2. Generating Edges. BTER edge generation is shown in Algorithm 2. The 
procedure Random_Sample does a weighted sampling according to a specified dis- 
crete distribution. For p bins, the cost is 0{\og{p)). For each edge, we randomly 
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Algorithm 1 BTER Setup 



1: procedure bter_setup({ }, {c^}, /3) 

Number nodes from least degree to greatest, except degree- 1 nodes are last 



12 ^ 1 

for d = 3, . . . , dmax do 

id ^ id-1 + rid-i 
end for 

Compute number of nodes with degree greater than d 

for = 1, . . . ,C^max do 
^ Hd'>d'^d 

end for 



Handle degree- 1 nodes 

10: nf^ ^ p-ni.wi^ |ni, rf^ ^ 1 

Main loop 

11: ^ ^ 0, nf^ ^ 0, (i* ^ 

12: for d = 2,. . . , c^max do 

13: if nj^^ > then > Try to fill incomplete block from current group 

14: n|^^ ^ mm{nf\nd) 

15: ^ ^ ^ - 

16: ^ infin(rf_rf,) 

17: else 

18: ^ 0, wT ^ 

19: end if 

20: nT'' ^na- 

21: if n^^^^ > then > Create a new group for degree-d bulk nodes 

22: 9^g+l 

23: ig ^id+ nf^ 

24: 6,^ rnrV(^+l)l 

25: ^ (i+ 1 

26: if bg ' {d -\- 1) > {rid -\- ^d"^^^) then > Special handing of last group 

27: if 6^ 7^ 1 then throw error 

28: n,^K + nr') 

29: end if 

30: p* ^ ^ 

31: (i* ^ (n^ — 1) • 

32: ^r'^|rir'-(^-^*) 

33: Wg ^bg- \ng{ng - 1) • log(l/l - p*) 

34: ^ {bg • n^) - n^^^^ 

35: else 

36: w^"" ^ 

37: end if 

38: Wd ^ wT + w'^^'^''^, rd ^ wf'/wd 

39: end for 



40: return { id } , { } , { } , { nf^ } ^Wg} ,{ig} ,{bg} ,{ng} 
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Table A.l: Description of variables. A ^ symbol in the second column indicates that 
this variable is explicitly computed and stored for use by the sampling procedure; a 
X symbol means the variable is not explicitly computed. The last column gives the 
formula to derive it from the stored values. 



Scalars 


n 




Total number of nodes 


n = ^nd 


m 




Total number of edges 


m= 1 (i • 


w 




Total number of edges to insert 


w = w'^^^ -\-w'^'^^ 


G^max 


* 


Largest (desired) degree 




^max 


X 


Total number of affinity blocks 


bmax — ^^qbg 


5'max 


* 


Total number of affinity groups 


gmax ^ C^max 


/3 


* 


Blowup factor for degree- 1 vertices 




Node Level i = 1, . . . ,n 


di 


X 


Degree (desired) of node i 




hi 


X 


Block id for degree i 




Wi 


X 


Excess degree of nodes i 


Wi = ^[di- {pb, - db,)] 


Block Level b = 1, . . . , 6max 


db 


X 


Minimum degree in block b 




Pb 


X 


Connectivity of degree b 


Pb = ^/c^ 


rib 


X 


Number of nodes in block b 


rib = db-\-l 


rub 


X 


Number of unique edges in block b 


mb = Pbi"'^') 


Wb 


X 


Weight of block b 


Wb={-,^) Hl/{l-pb)) 


Degree Level d— 1, . . . , c/max 


rid 


* 


Number of nodes of degree d 




Cd 


* 


Mean clustering coefficient for nodes of degree d 




id 


* 


Index of first degree of degree d 




n'd 




Number of nodes of degree greater than d 


^'d = Y.d'>d^d 


nf 


* 


Number of fill nodes of degree d 








Number of bulk nodes of degree d 


rid —rid - rid 


Wd 


* 


Excess degree of nodes of degree d 




Td 


* 


Ratio of ffil excess degree for degree d 








Excess degree of fill nodes of degree d 


Wd^^ = Td'Wd 






Excess degree of bulk nodes of degree d 


n^r^-^d-wT 


Group Level g = 1, . . . , g^a.^ 


ig 


* 


Index of first node in group g 




bg 


* 


Number of affinity blocks in group g 




Ug 


* 


Number of nodes per block in group g 




Wg 


* 


Weight of group g (including duplicate edges) 




Phase Level k — 1^2 






Weight of phase k 


w^^^ = Y^g^g 

^(2) 



select between the phases using a weighted coin. A Phase 1 edge requires one sample 
from a discrete distribution of size ^max ^^nd three additional random values drawn 
uniformly from [0, 1]. A Phase 2 edges requires two samples from a discrete distribu- 
tion of size (imax and four additional random values drawn uniformly from [0, 1]. Since 
^max ^ ^^max5 ^u uppcr bouud ou the cost per edge is the cost of one discrete random 
sample on a distribution of size c/max plus four random values drawn uniformly from 
[0,1]. 

19 



Algorithm 2 BTER Sample 



1: procedure bter_SAMPLe({ Ud} , {id} , {wd} , {vd} , { nf^ } , {^g } d^g } d^g } d'^g }) 

2: w'^^^ ^ ^gWg, w'^^^ ^J2^d,w^ w'^^^ +16;^^^ 

3: ^(1) ^0,^(2) ^0 

4: for j — 1, . . . ,w do 
5: r-[/[0,l] 
6: if r < w'^^^ /w then 

7: E^^^ ^ E^^^yj BTER_SAMPLE_PHASEl({ Wg} , {ig} , {hg} , {Ug}) 

8: else 

9: E^^^ ^ E^^^U bter_sample_phase2({ 11;^ } , { rd } , { rid } , { nf^ } , { }) 

10: end if 

11: end for 

12: return E^^\ ^^^^ 

13: procedure BTER_SAMPLE_PHASEl({ Wg } , {ig } , {bg } , {ug }) 

14: g ^ RANDOM_S AMPLE ({it;^^ }) > Choose group 

15: ri ^ U[0, 1], S = ig -\- [ri • bg\ • Ug > Choose block and compute its offset 

16: r2 U[0, 1], i ^ [r2 • n^J -\- 6 > Choose 1st node 

17: rs t/[0, 1], i ^ L^3 • (rig +S > Choose 2nd node 

18: if i > i then 
19: j^j + l 

20: end if 

21: return 

22: procedure bter_sample_phase2({ it;^ } , { } , { } , { n^^^ } , { }) 

23: i ^ bter_sample_phase2_node({ it;^ } , { } , { rid } , { rid^^ } , { ^d }) 

24: j ^ BTER_SAMPLE_PHASE2_N0DE({ Wd} , {vd} , {ud} , { nf^ } ,{id}) 

25: return (i^j) 

26: procedure bter_sample_phase2_node({ Wd} , {'^d} , {'^d} , { Ud^^ } , { }) 

27: ^ RANDOM_SAMPLE({i(;d }) > Choose degree 

28: n - [/[0,1], rs - [/[O, 1] 

29: if n < Td^^ then 

30: [r2- nf^\ + id > Fill node 

31: else 

32: [r2- (ud - nf^)\ + (id + nf^) > Bulk node 

33: end if 

34: return i 



In Algorithm 2, we generate each edge independently. It may also be possible 
to "bulk" the computations by first determining the total number of edges for each 
phase and perform the computation for each phase separately. Within each phase, 
the procedure itself can be easily vectorized to boost runtime performance, as in 
MATLAB. 

A. 3. Edge Deduplication. Any method can be used for deduplication. In 
general, the simplest procedure is to hash the edges in such a way that and 
(j, i) hash to the same key. Then its easy enough to sort each bucket to remove 
duplicates. In a parallel environment, since we are hashing by edge and not vertex, 
there should not be load balancing problems. In fact, hashing by a single endpoint is 
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not recommended because of the heavy-tailed nature of the graph. 



Appendix B. Coupon Collector Derivation. 

Consider a universe of U of objects/coupons, and suppose we pick objects uni- 
formly at random with replacement from U. The following theorem proves the desired 
bound, when U is the set of possible pairs in an affinity block (so \U\ = (^^)). 

Theorem B.l. For a given p G (0, 1); the expected number of independent draws 
required to select p\U\ distinct coupons from U is \U\ln{l/{l — p)) -\- 0(1). 

Proof For convenience, we assume that p\U\ is an integer. Consider a sequence 
of draws. Let Xi (for integer < i < p\U\) be the random variable denoting the 
number of draws required to get one more (distinct) coupon after i distinct coupons 
have been collected. Observe that the quantity of interest is IE[X]i<p|t/| ^i]^ which by 
linearity of expectation is X]i<p|c/| ^[-^i]- 

When i distinct coupons have already been collected, the probability that a single 
draw gives a new coupon is exactly 1 — VI ^1- Think of this as probability of "fail- 
ure." The number of draws required for a success (new coupon) follows a geometric 
distribution (Chap VI.8 of [16]) and the mean of this is -i/\U\) = \U\/{\U\-i). 
Using this bound, the expected total number of draws can be expressed as follows: 



(We use the standard bound for the Harmonic sum: Xl^<^ 1/i = lnr + 7 + 0(l/r), 
where 7 is Euler-Mascheroni constant.) □ 
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